The concepts, methods and techniques learned in class to solve real world problem using visual analytics techniques.
Mini-Challenge 1 looks at the relationships and conditions that led up to the kidnapping. As an analyst, you have a set of current and historical news reports at your disposal, as well as resumes of numerous GAStech employees and email headers from two weeks of internal GAStech company email. Can you identify the complex relationships among all of these people and organizations?
Note: This scenario and all the people, places, groups, technologies, contained therein are fictitious. Any resemblance to real people, places, groups, or technologies is purely coincidental.
In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.
In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearances.
It is January 21, 2014, and as an expert in visual analytics, you have been tasked with helping people understand the complex relationships among people and organizations that may have contributed to these events.
Mini-Challenge 1 looks at the relationships and conditions that led up to the kidnapping. As an analyst, you have a set of current and historical news reports at your disposal, as well as resumes of numerous GAStech employees and email headers from two weeks of internal GAStech company email. Can you identify the complex relationships among all of these people and organizations?
Use visual analytics to analyze the available data and develop responses to the questions below. In addition, prepare a video that shows how you used visual analytics to solve the challenge.
Characterize the news data sources provided. Which are primary sources and which are derivative sources? What are the relationships between the primary and derivative sources? Please limit your answer to 8 images and 300 words.
Characterize any biases you identify in these news sources, with respect to their representation of specific people, places, and events. Give examples. Please limit your answer to 6 images and 500 words.
Given the data sources provided, use visual analytics to identify potential official and unofficial relationships among GASTech, POK, the APA, and Government. Include both personal relationships and shared goals and objectives. Provide evidence for these relationships. Please limit your answer to 6 images and 400 words.
packages = c('igraph','tidygraph',
'ggraph','visNetwork',
'lubridate','clock',
'tidyverse',"tm",
'tidytext',
'widyr', 'wordcloud',
'DT', 'ggwordcloud',
'textplot','hms',
'tidygraph', 'ggraph',
'igraph','flipTime')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
GAStech_nodes <- read_csv("data source/MC1/EmployeeRecords.csv")
GAStech_edges <- read_csv("data source/MC1/email headers.csv")
GAStech_nodes$Name <- paste(GAStech_nodes$FirstName,GAStech_nodes$LastName)
GAStech_nodes$Name <- gsub(" ",".",GAStech_nodes$Name)
GAStech_nodes$id <- c(1:nrow(GAStech_nodes))
GAStech_nodes$military[GAStech_nodes$MilitaryDischargeDate != "NA"] <- "YES"
GAStech_nodes$military[is.na(GAStech_nodes$military)] <- "NO"
GAStech_nodes$BirthDate = mdy(GAStech_nodes$BirthDate)
GAStech_nodes$CitizenshipStartDate = mdy(GAStech_nodes$CitizenshipStartDate)
GAStech_nodes$PassportIssueDate = mdy(GAStech_nodes$PassportIssueDate)
GAStech_nodes$PassportExpirationDate = mdy(GAStech_nodes$PassportExpirationDate)
GAStech_nodes$CurrentEmploymentStartDate = mdy(GAStech_nodes$CurrentEmploymentStartDate)
GAStech_nodes$MilitaryDischargeDate = mdy(GAStech_nodes$MilitaryDischargeDate)
GAStech_edges_new <- GAStech_edges %>%
separate_rows(To)
GAStech_edges_new$From <- removeWords(GAStech_edges_new$From, "@gastech.com.kronos")
GAStech_edges_new$Date <- mdy_hm(GAStech_edges_new$Date)
GAStech_edges_new$Weekday <- wday(GAStech_edges_new$Date,
label = TRUE,
abbr = FALSE)
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From != "gastech.com.kronos",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != "gastech.com.kronos",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From != "@gastech.com.tethys",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != "@gastech.com.tethys",]
GAStech_edges_new$From <- tolower(GAStech_edges_new$From)
GAStech_edges_new$To <- tolower(GAStech_edges_new$To)
GAStech_edges_new$From[GAStech_edges_new$From == "sten.sanjorge jr."] <- "sten.sanjorge.jr"
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != GAStech_edges_new$From,]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$Weekday != "NA",]
vc <- unique(GAStech_nodes_new$Name)
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From %in% vc,]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To %in% vc,]
nodes_ref <- GAStech_nodes_new[,1:2]
colnames(nodes_ref) <- c("Source", "From")
GAStech_edges_new1 <- merge(GAStech_edges_new, nodes_ref[, c("Source", "From")], by="From")
colnames(nodes_ref) <- c("Target", "To")
GAStech_edges_new1 <- merge(GAStech_edges_new1, nodes_ref[, c("Target", "To")], by="To")
GAStech_edges_aggregated <- GAStech_edges_new1 %>%
group_by(Source, Target, Weekday) %>%
summarise(Weight = n()) %>%
filter(Weight > 1) %>%
ungroup()
GAStech_graph <- tbl_graph(nodes = GAStech_nodes_new,
edges = GAStech_edges_aggregated,
directed = TRUE)
GAStech_graph %>%
activate(edges) %>%
arrange(desc(Weight))
# A tbl_graph: 54 nodes and 1813 edges
#
# A directed multigraph with 2 components
#
# Edge Data: 1,813 x 4 (active)
from to Weekday Weight
<int> <int> <ord> <int>
1 40 41 Tuesday 23
2 40 43 Tuesday 19
3 41 43 Tuesday 17
4 42 40 Tuesday 17
5 41 40 Tuesday 16
6 42 41 Tuesday 15
# ... with 1,807 more rows
#
# Node Data: 54 x 21
id Name EmailAddress CurrentEmployme~ CurrentEmployme~
<int> <chr> <chr> <chr> <chr>
1 1 mat.~ Mat.Bramar@~ Administration Assistant to CEO
2 2 anda~ Anda.Ribera~ Administration Assistant to CFO
3 3 rach~ Rachel.Pant~ Administration Assistant to CIO
# ... with 51 more rows, and 16 more variables: BirthCountry <chr>,
# Gender <chr>, CitizenshipCountry <chr>, CitizenshipBasis <chr>,
# CitizenshipStartDate <date>, military <chr>, LastName <chr>,
# FirstName <chr>, BirthDate <date>, PassportCountry <chr>,
# PassportIssueDate <date>, PassportExpirationDate <date>,
# CurrentEmploymentStartDate <date>, MilitaryServiceBranch <chr>,
# MilitaryDischargeType <chr>, MilitaryDischargeDate <date>
all_news <- "data source/MC1/News Articles/"
read_folder <- function(infolder) {
tibble(file = dir(infolder,
full.names = TRUE)) %>%
mutate(text = map(file,
read_lines)) %>%
transmute(id = basename(file),
text) %>%
unnest(text)
}
raw_text <- tibble(folder =
dir(all_news,
full.names = TRUE)) %>%
mutate(folder_out = map(folder,
read_folder)) %>%
unnest(cols = c(folder_out)) %>%
transmute(newsgroup = basename(folder),
id, text)
write_rds(raw_text, "data source/MC1/rds/all_news.rds")
raw_text_backup <- raw_text
raw_text1<-raw_text[!(raw_text$text==""|raw_text$text=="<< Continue reading the main story >>"),]
raw_text1<-raw_text1[!(raw_text1$text==" "|raw_text1$text=="<< to continue reading main history >>"),]
raw_text1<-raw_text1[!(raw_text1$text=="<< to continue reading main history >>"),]
cleaned_text <- raw_text1 %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
| text == "",
!str_detect(text,
"writes(:|\\.\\.\\.)$"),
!str_detect(text,
"^In article <"),
!str_detect(text,
"<<"),
!str_detect(text,
"To continue reading main history"),
!str_detect(text,
" to continue reading main history "),
!str_detect(text,
"continue to read the principal history"),
!str_detect(text,
"SOURCE:")
)
time_text <- raw_text1 %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
| text == "",
str_detect(text,
"PUBLISHED:")
)
time_text1 <- time_text
time_text1$text <- removeWords(time_text1$text, "PUBLISHED:")
time_text1$text <- removeWords(time_text1$text, "PUBLISHED: ")
time_text1$text1 <- dmy(time_text$text)
time_text1$text1 <- format(as.Date(time_text1$text1),'%d/%m/%Y')
time_text1$text2 <- ymd(time_text$text)
time_text1$text2 <- format(as.Date(time_text1$text2),'%d/%m/%Y')
time_text1$Time <- ifelse(is.na(time_text1$text1),
time_text1$text2,time_text1$text1)
time_text1 <- time_text1[,c(1,2,6)]
time_text1 <- time_text1[complete.cases(time_text1), ]
title_text <- raw_text1 %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
| text == "",
str_detect(text,
"TITLE:")
)
title_text$text <- removeWords(title_text$text, "TITLE: ")
colnames(title_text) <- c("News Source", "id", "Title")
cleaned_text <- cleaned_text %>%
filter(!str_detect(text,
"PUBLISHED:")
)
## display the tale with time
title_new <- merge(title_text, time_text1[, c("id", "Time")], by="id")
usenet_words <- cleaned_text %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
words_by_newsgroup <- usenet_words %>%
count(newsgroup, word, sort = TRUE) %>%
ungroup()
The similarity can be visualize by checking the most frequent words in each of the data source. In order to find out the correlation between different news sources, we firstly compute the tf-idf within each data source.
Can see directly from the top 12 highest tf-idf words of each group that, some groups share similar keywords. For example “candy”, appears in four sources:“News Online Today”, “The Continent”, “International Times”, “World Source”. From the DT table below we can type in “candy” and see the relative tf-idf values. Also, each source of news have certain preference of focus,therefore, for each topic the news, the primary source are more likely comes from the source with higher tf-idf scores.
wordcloud(words_by_newsgroup$word,
words_by_newsgroup$n,
max.words = 300)

raw_text %>%
group_by(newsgroup) %>%
summarize(messages = n_distinct(id)) %>%
ggplot(aes(messages,newsgroup)) +
geom_col(fill = "lightblue") +
labs(y = NULL)

tf_idf <- words_by_newsgroup %>%
bind_tf_idf(word, newsgroup, n) %>%
arrange(desc(tf_idf))
tf_idf %>%
group_by(newsgroup) %>%
filter(newsgroup == "Centrum Sentinel"|newsgroup == "Modern Rubicon"|
newsgroup == "Tethys News"|newsgroup == "News Online Today"|
newsgroup == "Kronos Star"|newsgroup == "Central Bulletin") %>%
slice_max(tf_idf,
n = 12) %>%
ungroup() %>%
mutate(word = reorder(word,
tf_idf)) %>%
ggplot(aes(tf_idf,
word,
fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup,
scales = "free") +
labs(x = "tf-idf",
y = NULL)

tf_idf %>%
group_by(newsgroup) %>%
filter(newsgroup == "The Guide"|newsgroup == "Worldwise"|
newsgroup == "Homeland Illumination"|newsgroup == "Athena Speaks"|
newsgroup == "The Explainer"|newsgroup == "Daily Pegasus") %>%
slice_max(tf_idf,
n = 12) %>%
ungroup() %>%
mutate(word = reorder(word,
tf_idf)) %>%
ggplot(aes(tf_idf,
word,
fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup,
scales = "free") +
labs(x = "tf-idf",
y = NULL)

tf_idf %>%
group_by(newsgroup) %>%
filter(newsgroup == "The Orb"|newsgroup == "The Abila Post"|
newsgroup == "World Source"|newsgroup == "News Desk"|
newsgroup == "Everyday News"|newsgroup == "The Continent") %>%
slice_max(tf_idf,
n = 12) %>%
ungroup() %>%
mutate(word = reorder(word,
tf_idf)) %>%
ggplot(aes(tf_idf,
word,
fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup,
scales = "free") +
labs(x = "tf-idf",
y = NULL)

tf_idf %>%
group_by(newsgroup) %>%
filter(newsgroup == "The General Post"|newsgroup == "The Tulip"|
newsgroup == "The Wrap"|newsgroup == "World Journal"|
newsgroup == "The Truth"|newsgroup == "The Light of Truth") %>%
slice_max(tf_idf,
n = 12) %>%
ungroup() %>%
mutate(word = reorder(word,
tf_idf)) %>%
ggplot(aes(tf_idf,
word,
fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup,
scales = "free") +
labs(x = "tf-idf",
y = NULL)

tf_idf %>%
group_by(newsgroup) %>%
filter(newsgroup == "International Times"|newsgroup == "Who What News"|
newsgroup == "The World"|newsgroup == "All News Today"|
newsgroup == "International News") %>%
slice_max(tf_idf,
n = 12) %>%
ungroup() %>%
mutate(word = reorder(word,
tf_idf)) %>%
ggplot(aes(tf_idf,
word,
fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup,
scales = "free") +
labs(x = "tf-idf",
y = NULL)

Also, word clouds are presented to better computing the main topic from each of media.
set.seed(1234)
words_by_newsgroup %>%
filter(n > 15) %>%
ggplot(aes(label = word,
size = n)) +
geom_text_wordcloud() +
theme_minimal() +
facet_wrap(~newsgroup)

The following DT table allows us to select text regarding to time, see the News sources and the title of News, can scroll and type in key words to select intended news and find out the primary source and secondary source reagrding to ksy words.
DT::datatable(title_new)
newsgroup_cors <- words_by_newsgroup %>%
pairwise_cor(newsgroup,
word,
n,
sort = TRUE)
The following DT table present the correlations between each of the News Media(New Source), and higher the correlations are, the more likely that two news media are sharing similar news are focusing on similar area of news. Also, they may be Primary source and Secondary source for each other, when combining with the title-time DT table can demonstrate better output.
By setting and increasing the threshold of correlation, a step-by-step community detection using visualization if performed. And can see from the graph that for a threshold of 0.7, two main communities can be distinguished, and medias from the same community are more correlated than other medias from other community.
set.seed(2017)
newsgroup_cors %>%
filter(correlation > .7) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation,
width = correlation)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()

Three communities are separated for a threshold of 0.75.
set.seed(2017)
newsgroup_cors %>%
filter(correlation > .75) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation,
width = correlation)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()

And five communities are separated for a threshold of 0.8.
set.seed(2017)
newsgroup_cors %>%
filter(correlation > .8) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation,
width = correlation)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()

All_text <- merge(cleaned_text, title_text[,c("id","Title")], by="id")
All_text <- merge(All_text, time_text1[, c("id", "Time")], by="id")
All_text$id <- iconv(All_text$id, to = "utf-8")
All_text$newsgroup <- iconv(All_text$newsgroup, to = "utf-8")
All_text$text <- iconv(All_text$text, to = "utf-8")
All_text$Title <- iconv(All_text$Title, to = "utf-8")
All_text$Time <- iconv(All_text$Time, to = "utf-8")
DT::datatable(All_text)

Also, Juliana Vann,a young girl died due to the environment pollution, share the same family name with two of the male employees of Gastech.


Notice the high frequency in News that the name of Juliana Vann appears.

Also, one of the emplyee is the member of POK as well as the older brother of Juliana Vann, as is shown in the picture attached.

And it is noticeable that high frequency of the family name “Vann” appears in the news.

A word “bend” constantly appears in the News, this could be related to Tiskele Bend fields, which is the main focus of POK targeting to environment protection, as well as the main business and source of income for GAStech.

GAStech_nodes_new <- GAStech_nodes_new %>%
rename(group = CurrentEmploymentType)
GAStech_nodes_new$title <- GAStech_nodes_new$Name
GAStech_nodes_new$label <- GAStech_nodes_new$Name
In the given interactive network graph, we are able to drag each node and see the email subject that related to the two persons. Each of the Node are colored by the department this person belongs to. And the color of the edges can help us to distinguish which department the email is sent by. Therefore we need to focus on the edges(emails) that colored difference from the target nodes, those are emails communicating between different departments.
And if we hover the mouse onto the edges, we are able to distinguish the exact subject of sent emails. Also, since we’ve already suspect the two employees from security department whose famaly names “Vann”, we are able to zoom in and have a closer look of their email exchanges with others.
visNetwork(GAStech_nodes_new,
GAStech_edges_aggregated2) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE,
type = "curvedCW")) %>%
visLegend() %>%
visLayout(randomSeed = 123)
visNetwork(GAStech_nodes_new,
GAStech_edges_aggregated2) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)
The general approach of consider all the emails between difference departments are personal can be analysed more accurate. Since the edges will only present one text for connection between two nodes, and the weight of edges shows the frequency of emailing, an approachable way is to search in the below DT table below and connect check all the emails to distinguish whether emails are offical or personal.
GAStech_edges_aggregated1$From <- iconv(GAStech_edges_aggregated1$From, to = "utf-8")
GAStech_edges_aggregated1$To <- iconv(GAStech_edges_aggregated1$To, to = "utf-8")
GAStech_edges_aggregated1$Date <- iconv(GAStech_edges_aggregated1$Date, to = "utf-8")
GAStech_edges_aggregated1$Subject <- iconv(GAStech_edges_aggregated1$Subject, to = "utf-8")
GAStech_edges_aggregated1$Weekday <- iconv(GAStech_edges_aggregated1$Weekday, to = "utf-8")
Node_present <- GAStech_edges_aggregated1[,c('From', 'To', 'Date','Subject','Weekday')]
DT::datatable(Node_present, filter = 'top') %>%
formatRound(columns = c('From', 'To',
'Date','Subject','Weekday'),
digits = 3) %>%
formatStyle(0,
target = 'row',
lineHeight='25%')
https://datastorm-open.github.io/visNetwork/edges.html